The application of machine learning approaches to determine the predictors of anemia among under five children in Ethiopia

Health professionals need a strong prediction system to reach appropriate disease diagnosis, particularly for under-five child with health problems like anemia. Diagnosis and treatment delay can potentially lead to devastating disease complications resulting in childhood mortality. However, the application of machine learning techniques using a large data set provides scientifically sounded information to solve such palpable critical health and health-related problems. Therefore, this study aimed to determine the predictors of anemia among under-5 year’s age children in Ethiopia using a machine learning approach. A cross-sectional study design was done using the Ethiopian Demographic and Health Survey 2016 data set. A two-stage stratified cluster sampling technique was employed to select the samples. The data analysis was conducted using Statistical Package for Social Sciences/SPSS version 25 and R-software. Data were derived from Ethiopian Demographic and Health Survey. Boruta algorism was applied to select the features and determine the predictors of anemia among under-5 years-old children in Ethiopia. The machine learning algorism showed that number of children, distance to health facilities, health insurance coverage, youngest child’s stool disposal, residence, mothers’ wealth index, type of cooking fuel, number of family members, mothers’ educational status and receiving rotavirus vaccine were the top ten important predictors for anemia among under-five children. Machine-learning algorithm was applied to determine the predictors of anemia among under- 5 year’s age children in Ethiopia. We have identified the determinant factors by conducting a feature importance analysis with the Boruta algorithm. The most significant predictors were number of children, distance to health facility, health insurance coverage, youngest child’s stool disposal, residence, mothers’ wealth index, and type of cooking fuel. Machine learning model plays a paramount role for policy and intervention strategies related to anemia prevention and control among under-five children.


Sample size estimation and sampling techniques
The investigation was conducted from January 18 to June 27, 2016.The Ethiopian Demographic and Health Survey, which was finished in 2016, provided the information for this study.As a member of the worldwide Demographic and Health Surveys program, Ethiopia participated in the EDHS 2016 for the fourth time.The Central Statistical Agency (CSA) collected it at the Federal Ministry of Health's request (FMoH).Data were gathered using a standardized, previously verified questionnaire.During the EDHS interviews, interviewers utilized tablet computers to record replies.The tablets had Bluetooth technology installed to support distant electronic file transmission (transferring assignment sheets from team supervisors to interviewees and completed copies

Study variables and measurements
The outcome variable was the presence or absence of anemia in a child under the age of five, and it was coded with a value of zero to denote the absence of anemia and a value of one to denote the presence of anemia.A hemoglobin or hematocrit value has been necessary for a kid to be identified as having anemia, which were measured based on mothers' complaints regarding the symptoms of these illnesses 24 .

Data quality control
The data for this study was obtained from the 2016 EDHS secondary data.The data was extracted by following strictly required procedures as shown in the above figure one.Data cleaning was done carefully.EDHS 2016 quality of the data was determined primarily by the quality of the fieldwork, following appropriate steps that can enhance it considerably during data processing.Data entry and editing for inconsistencies were critical steps in this research to remove missing data.Each relevant and pertinent data pre-processing step have been performed to assure the quality of the data.

Data processing
There are 25 features and 9501 instances in the extracted datasets.Data preprocessing techniques such as data cleaning, data transformation, handling, class imbalance, and feature selection methods were used because all of these features are not pertinent for creating a predictive model that can predict the anemia among children under the age of five in the case of Ethiopia.Mode imputation methods for categorical data were used to fill in the missing values.Manual removal of redundant data was done.Features with more categorical values, such as the source of drinking water, body mass index, wealth index, parents' occupation, and the type of fuel they used were transformed into discrete values using binning discretization mechanisms.These features have multiple Figure 1.Sampling procedure for under-five children's anemia prediction analysis.distinct values and need to be transformed for mining purposes.Then, the essential characteristics that were crucial for the following steps were chosen using the Boruta Algorithm feature selection approach (Fig. 3).
A total of 9501 instances with 14 attributes were taken into consideration for further analysis and prediction model construction after completing all necessary data preprocessing activities.

Ethical approval and consent to participate
The researchers received the survey data approval letter from the USAID DHS program after registering with the link https:// www.dhspr ogram.com/ data/ datas et_ admin/ login_ main.cfm and then the researchers of this study maintained the confidentiality and privacy of the data.We have obtained authorization letter from ICF to use this data and we attached the letter as an annex.The study does not require ethical approval because it was a secondary data analysis using the 2016 EDHS database.After receiving the data from the USAID-DHS program, the researchers in this study maintained the data's anonymity.During the survey, informed consent was received from the study participants prior to the start of study.All methods were carried out in accordance with relevant guidelines and regulations.

Descriptive results of the socio-demographic characteristics
From a total of 9501 study participants, 51.1% of them were males and nearly 81% of the respondents were living in a rural area.On the other hand, about 64% of the participants had no educational background at all.Nearly 37% of the respondents were the poorest regarding the family wealth index.Around 60 percent of the mothers had no work.Majority of the respondents, 21.1% of them had age ranging in 48-59-month age category (Table 1).2).

Nutritional and co-morbid characteristics among under-five children
Among the total participants, nearly 96% of kids breastfed and about 8532(89.8%) of the respondents had no history of diarrhea.However; 5613 (59%) of the kids and the majority of them, 8393(88.3%)did't take Vitamin-A supplement and intestinal parasites' drugs in the previous six months respectively.Regarding nutritional status, about 922 (9.7%) of the respondents were wasted and 3997 (42.1%) were stunted (Table 3).The prevalence of anaemia among under-five children was nearly 40% in Ethiopia according to Ethiopian Demographic Health Survey Data (Fig. 2).

Feature selection and determination of anemia's predictors among under-five children
The selection of features is an important phase in predictive modeling 25,26 .This method assumes utmost significance when a data set with a number of variables is provided for model construction.For this study, we used a Boruta feature selection algorism, a method that is common to apply when we are interested in understanding the mechanisms related to the variable of interest (Fig. 3).In the Fig. 3, variables in the boxplot sorted by increasing importance and colored in green are those which were classified as relevant and confirmed by the algorithm, and variables in red color are those which are irrelevant and rejected by the algorithm.The blue color boxplot indicated that shadow attributes which are created by the Boruta algorithm for benchmark or reference in variable importance comparing with category detection either the variables are relevant, tentative, or irrelevant.
Using the Boruta feature selection method, fourteen variables out of twenty-eight variables were selected as important features for model construction.Number of children, distance to health facility, health insurance coverage, youngest child's stool disposal, residence, mothers' wealth index, mothers' educational status, occupation, type of toilet facility were some of the variables found to be significant for model building (Fig. 3).The variable description code for the Boruta Algorism has been shown in the following table (Table 4).

Discussion
This study briefly described the prevalence of anemia and its predictors among under five children in Ethiopia using machine learning techniques.In this regard, the model showed that number of children, distance to health facility, health insurance coverage, youngest child's stool disposal, residence, mothers' wealth index, type of cooking fuel, number of family members, mothers' educational status and receiving rotavirus vaccine were the top ten important predictors for anemia among under-five children.The results of the ML model, in comparison, seem to be nearly in line with those of the conventional logistic regression analysis.It indicates that factors such as; distance to health facility, health insurance coverage, youngest child's stool disposal, residence, mothers' wealth index, type of toilet facility, number of family members, mothers' educational status play a significant role in anemia level in children under the age of five in Ethiopia.In contrast to the usual logistic regression analysis, only receiving rotavirus vaccine appears to be relevant factor in ML models.This suggests that ML models could generate some "different variables" or now-unknown insights from the conventional regression models that could be essential in policy decision-making.
The findings of this study demonstrated a substantial relationship between anemia and health insurance coverage.This finding is in line with various studies conducted in Ghana [27][28][29] , where children who are not insured are at a higher risk of developing anemia compared to insured children.This is due the fact that health insurance serves as a strategy to acquire health services for health problems including anemia as easily as possible.Hence, uninsured households are at risk of financial shocks for health service costs, which leads to further progression of health problems including anemia and its treatment delay.www.nature.com/scientificreports/Surprisingly, there is a significant association between anemia and the type of cooking fuel.According to the findings of this study, children who are exposed to unclean cooking fuel and smoking are more likely to develop anemia than their counterparts are.This finding is supported by various studies conducted in Georgia State University 30 , India 31 and survey of sub-Saharan countries 32 .This may be due to the reason that exposure to cooking fuel could lead to systemic inflammation mediated by inflammatory cytokines.
Similarly, the findings of this study revealed that household wealth status is a significant predictor of underfive child anemia.The occurrence of anemia among under-five children who are from rich household families is lower than children who are from poor families.This finding is in line with a study conducted in different locations of Ethiopia; Filtu town Somalia region 33 , southwest Ethiopia 34 ,Wolaita 35 , Kombolcha 36 , South wollo 37 and northeast 38 , University of Gondar 39 .It is also in agreement with studies conducted in Rwanda 40 , Bahir Dar University 41 ,and Sudan 42 .The possible justification could be food scarcity, poor hygiene and sanitation, and poor childcare resulting in malnutrition including iron deficiency anemia.
According to this study, there is a significant relationship between anemia and receiving rotavirus vaccine among under-five children.Children who did not receive the rotavirus vaccine are more susceptible to develop anemia compared to those who receive the rotavirus vaccine.This is because children, which do not take the rotavirus vaccine, are at a higher risk of developing diarrhea that could result in the occurrence of anemia among under-five children 43 .Diarrhea could suppress the children's immunity which exposes them to be vulnerable for other health problems resulting in nutritional deficiency including iron deficiency anemia.
Additionally, this study pointed out a significant association between the youngest child's stool disposal and under-five children's anemia.Children who disposed their stool improperly are high likely to develop anemia compared to their counterparts.This finding is in agreement with a study conducted in Tanzania 44 .This can be due to the exposure of children to helminths from improperly disposed stools.This could lead to a high chance of helminthic disease transmission like hook worm, which can lead to low food absorption, poor food appetite, gastro intestinal bleeding, and other complications, lastly resulting in anemia.
On the other hand, this study showed a substantial association between anemia and the number of family members and children, where children who are living in households having a large number of family sizes and children are at a higher risk of developing anemia compared to those children who are living with a small family size.This is in line with studies conducted in Southwest Ethiopia 34 and University of Gondar 39 .This is because children living with a large number of children and family members might be in competition for foods and are easily susceptible to communicable diseases, which all entirely lead to nutritional deficiencies, particularly iron deficiency anemia.www.nature.com/scientificreports/Under-five children's anemia was also shown to be predicted by the mother's educational level.This assertion is in line with past discoveries conducted in South Wollo 37 , Jimma University 39 , University of Gondar 39 , Uganda 21 , sub-Saharan countries 32 , Malawi 4 , and Rwanda 40 , Togo 45 and Palestine 46 .This could be because education enables moms to manage their surroundings, including healthcare facilities, collaborate with medical experts more successfully, adhere to treatment recommendations, and maintain a clean environment.Furthermore, women with greater education have more influence over the health choices that their kids make.
Likewise, residence is significantly associated with under-five children's anemia according to the findings of this study, where children who are living in rural resident are more at risk of developing anemia compared to their counterparts.This finding is supported by studies conducted in southwest Ethiopia 34 , Sudan 42 , Brazil 47,48 , Bolivia 49 .This is because low economic status, lack of iron-rich foods, lack of information about balanced dietary intake, and high number of illiterates could be linked with the occurrence of anemia among under-five children.
Moreover, the distance of health facility predicted the occurrence of anemia among under five children according to the findings of this study.Children who are living at a higher distance far from their health facility are at a higher risk of developing anemia.This is in line with a study conducted in University of Gondar 50 .This is due to the reason that children living at a longer distance far from their health facility might have inadequate health seeking behavior, which leads to delayed treatment for common childhood illness and lastly results in anemia following complicated health problems due to inaccessible health facilities.
Even though the importance of the variables is less compared with the other variables, mothers' occupation, toilet facility, and antiretroviral infection were also figured as a predicting factor for anemia among under-five children.Finally, the ML findings appear to be mostly incomprehensible in contrast to the classical regression models since they lack regression coefficients and, consequently, a direction of impact.In practice, ML models often categorize or forecast particular factors depending on how significant a role they had in influencing the anemia level among children under the age of five in the current research.The direction of these crucial factors may be ascertained in this situation utilizing the available empirical literature from investigations employing conventional approaches.However, machine-learning methods are considered extremely helpful in predicting the determinants of population health and other phenomena resulting in policy decisions improvement 51,52 .

Strengths and limitations of the study
Large and nationwide sample usage is the crucial strength of this study; this can enable this study to infer the results for the general population.The application of advanced machine learning prediction algorithm was another unique strength of the study.The limitation of this study was the restriction of researchers just to the

Figure 2 .
Figure 2. Prevalence of anemia among under-five age of children in Ethiopia.

Table 3 .
Nutritional and co-morbid characteristics of Anemia among under-five children in Ethiopia from January 18 to June 27, 2016 (N = 9501).ARI antiretroviral infection.

Table 4 .
Variable description codes for the above Boruta algorism figure.